Choosing LSI Dimensions by Document Linear Association Analysis
نویسندگان
چکیده
Latent Semantic Indexing (LSI) has proven to be a valuable analysis tool with a wide range of applications. however the crucial question, choosing an appropriate number of dimensions for LSI, is still unsolved. In this paper. a new method which is to deal with this problem is described. It finds that a sum of total dot products between all document vectors reaches the maximum value at a specific number of dimensions for a given dataset With this reduced dimensions LSI achieves the best performance. The performance evaluations have demonstrated that this method can choose an appropriate number of dimensions for LSI and effective detect the data structure for a dataset.
منابع مشابه
A Text Mining Model by Using Weighting Technology
In Latent Semantic Indexing (LSI) has been proven to be a valuable analysis tool with a wide range of applications. However choosing an appropriate number of dimensions for LSI is still a crucial challenge. This paper provides a document vector model, by using weighting technology, to deal with this problem. Our experimental results have demonstrated that this model can detect a dataset structu...
متن کاملLSI vs. Wordnet Ontology in Dimension Reduction for Information Retrieval
In the area of information retrieval, the dimension of document vectors plays an important role. Firstly, with higher dimensions index structures suffer the “curse of dimensionality” and their efficiency rapidly decreases. Secondly, we may not use exact words when looking for a document, thus we miss some relevant documents. LSI (Latent Semantic Indexing) is a numerical method, which discovers ...
متن کاملA probabilistic model for Latent Semantic Indexing
Dimension reduction methods, such as Latent Semantic Indexing (LSI), when applied to semantic space built upon text collections, improve information retrieval, information filtering and word sense disambiguation. A new dual probability model based on the similarity concepts is introduced to provide deeper understanding of LSI. Semantic associations can be quantitatively characterized by their s...
متن کاملUsing Linear Algebra for Intelligent Information Retrieval
Currently, most approaches to retrieving textual materials from scienti c databases depend on a lexical match between words in users' requests and those in or assigned to documents in a database. Because of the tremendous diversity in the words people use to describe the same document, lexical methods are necessarily incomplete and imprecise. Using the singular value decomposition (SVD), one ca...
متن کاملLinear Discriminant Analysis in Document Classification
Document representation using the bag-of-words approach may require bringing the dimensionality of the representation down in order to be able to make effective use of various statistical classification methods. Latent Semantic Indexing (LSI) is one such method that is based on eigendecomposition of the covariance of the document-term matrix. Another often used approach is to select a small num...
متن کامل